Guard Windows ROCm torchao override skip by InfoSage05 · Pull Request #6837 · unslothai/unsloth

InfoSage05 · 2026-07-03T08:34:35Z

What changed

This PR hardens the Windows ROCm torchao skip path in studio/install_python_stack.py.

Previously, the installer skipped torchao only when _rocm_windows_torch_installed had been set earlier in the install flow. This change adds a direct runtime probe of the installed torch build and skips the torchao override whenever the target venv already contains a Windows ROCm torch wheel.

Why it changed

Issue #6833 reports that on Windows ROCm, torchao==0.17.0 crashes on import because the ROCm Windows torch build does not expose torch.ops._c10d_functional.all_gather_into_tensor.

The repo already had the main mitigation in place:

Studio runtime stubs torchao on Windows ROCm so transformers and peft imports do not crash.
install_python_stack.py skips installing torchao when the ROCm-install path marks _rocm_windows_torch_installed.

The remaining gap was that the skip depended on that earlier flag being present. If the flag was missed or stale but the environment already contained a ROCm torch wheel, the installer could still install torchao and ship a package that crashes on import.

Root cause

The issue is not general ROCm failure and not GPU detection failure. The root cause is that Windows ROCm torch can be present in the venv while the torchao override decision still relies on installer state rather than the actual installed torch runtime.

This PR closes that gap by checking the installed torch runtime directly before applying the torchao override.

User impact

Windows ROCm installs no longer depend solely on _rocm_windows_torch_installed to avoid installing incompatible torchao.
If the venv already has a ROCm torch build, the installer now skips torchao consistently.
This reduces the chance of Studio falling into the import-crash path described in #6833.

Validation

Focused tests:

pytest -q tests/studio/install/test_rocm_support.py -k 'RocmTorchInstalledEnvVar or WindowsRocmTorchaoGuard'

Broader installer coverage:

pytest -q tests/studio/install/test_rocm_support.py tests/studio/install/test_pr5940_followups.py

Results:

Focused slice: 8 passed
Broader slice: 378 passed, 4 skipped

When doing full finetuning (FFT) of a bfloat16 model, the fp16/bf16 mismatch validation fires before the corrective logic runs, causing a misleading error even though the code would properly handle it downstream. Skip the validation when full_finetuning is active. Fixes unslothai#6731

…idation Instead of entirely skipping validation (which could let mismatches through when mixed_precision_dtype is float32), auto-correct explicit fp16/bf16 settings that conflict with the model's dtype for FFT. This way the existing validation still catches real mismatches for non-FFT cases, and the corrective logic below handles the normalized settings. Fixes the issue raised in Codex review of PR unslothai#6813.

Detect installed ROCm torch directly before applying the torchao override so Windows ROCm environments never install the crashing torchao package even if the earlier ROCm-installed flag is missing.

gemini-code-assist

Code Review

This pull request introduces a check to detect Windows ROCm PyTorch builds to safely skip torchao installation, along with corresponding unit tests. It also updates the TRL RL trainer patching logic to automatically adjust precision flags (use_fp16/use_bf16) during full finetuning to match the model's dtype. The review feedback suggests making the ROCm torch probe more robust against stray stdout noise by checking the last non-empty line, and ensuring that the underlying args.fp16 and args.bf16 attributes are updated alongside the local variables to maintain consistency.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Tolerate stray stdout noise when probing Windows ROCm torch installs by checking the last non-empty output line, matching the existing torch version probe behavior. Also keep args.fp16 and args.bf16 synchronized with the full-finetuning precision auto-corrections in the RL trainer patch so downstream eval settings see a consistent TrainingArguments state.

for more information, see https://pre-commit.ci

Imagineer99 · 2026-07-03T11:46:31Z

Main installer change looks correct.

I have one question, why is the unsloth/models/rl.py precision flag change included? The PR description is focused on Windows ROCm/torchao, was it intentional to include?

Could we also make the new guard test a bit more behavioral rather than checking the exact source string? For example, patch _installed_torch_is_windows_rocm() to return true and assert that the torchao install path is skipped. I think that would cover the intent more directly and be less brittle.

InfoSage05 · 2026-07-03T12:06:25Z

@Imagineer99 Yes! the unsloth/models/rl.py change was intentional, but it is unrelated to the Windows ROCm / torchao guard. It came from separate review feedback on the same branch, so I agree it makes this PR less focused. I’ll remove that RL precision-flag change from this PR and keep this branch scoped to the Windows ROCm installer behavior only.

Good point on the test as well. I’ll switch the new guard test to a behavioral one by patching _installed_torch_is_windows_rocm() to return True and asserting that the torchao install path is skipped, instead of asserting on the exact source string. That should better cover the intended behavior and avoid brittle source-level coupling.

Patch imported MLXTrainer and MLXTrainingConfig objects to preserve the expected dataclass field ordering and to provide a _train_dataset_for_batches fallback when older trainers or test doubles only expose train_dataset. Also add focused worker tests covering both compatibility paths.

chatgpt-codex-connector · 2026-07-03T12:51:48Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

… into issue-6833-windows-rocm-torchao

for more information, see https://pre-commit.ci

Imagineer99 · 2026-07-03T18:24:15Z

I pushed the cleanup we discussed.

The PR is now scoped to the Windows ROCm torchao guard: the installer skips torchao when either the ROCm install marker is set or the installed torch runtime probes as Windows ROCm. I also removed the unrelated RL/MLX changes and switched the test coverage to exercise the behavioral skip path instead of relying on source-string assertions.

CI is green on the latest head, merging.

Ayushman Paul added 3 commits July 2, 2026 12:39

Guard Windows ROCm torchao override skip

63b8b4b

Detect installed ROCm torch directly before applying the torchao override so Windows ROCm environments never install the crashing torchao package even if the earlier ROCm-installed flag is missing.

gemini-code-assist Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread studio/install_python_stack.py Outdated

Comment thread unsloth/models/rl.py

InfoSage05 and others added 7 commits July 3, 2026 14:20

Update unsloth/models/rl.py

4d0475d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update studio/install_python_stack.py

f1091e8

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Merge upstream/main into issue-6833-windows-rocm-torchao

0a8b0cf

Merge remote PR updates into issue-6833-windows-rocm-torchao

e74e491

[pre-commit.ci] auto fixes from pre-commit.com hooks

1b12c69

for more information, see https://pre-commit.ci

Merge branch 'main' into issue-6833-windows-rocm-torchao

bf77875

InfoSage05 marked this pull request as ready for review July 3, 2026 12:51

InfoSage05 requested review from Datta0, danielhanchen and pluesclues as code owners July 3, 2026 12:51

Ayushman Paul and others added 7 commits July 3, 2026 12:52

Merge remote-tracking branch 'origin/issue-6833-windows-rocm-torchao'…

262f699

… into issue-6833-windows-rocm-torchao

[pre-commit.ci] auto fixes from pre-commit.com hooks

3705eb0

for more information, see https://pre-commit.ci

Merge branch 'main' into issue-6833-windows-rocm-torchao

a1d9c9c

Merge branch 'main' into issue-6833-windows-rocm-torchao

d3f4270

Scope PR to Windows ROCm torchao guard

eba4669

Restore PR scope to Windows ROCm guard

c177ce1

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5f34ed

for more information, see https://pre-commit.ci

Imagineer99 mentioned this pull request Jul 3, 2026

PR #6837 staging CI Imagineer99/unsloth#35

Open

Imagineer99 and others added 3 commits July 3, 2026 18:49

test: cover Windows ROCm torchao skip behavior

1492dd0

[pre-commit.ci] auto fixes from pre-commit.com hooks

070e9b9

for more information, see https://pre-commit.ci

Merge branch 'main' into issue-6833-windows-rocm-torchao

9fed497

Merge branch 'main' into issue-6833-windows-rocm-torchao

a4d4967

Imagineer99 approved these changes Jul 3, 2026

View reviewed changes

Imagineer99 merged commit c356427 into unslothai:main Jul 3, 2026
36 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Guard Windows ROCm torchao override skip#6837

Guard Windows ROCm torchao override skip#6837
Imagineer99 merged 22 commits into
unslothai:mainfrom
InfoSage05:issue-6833-windows-rocm-torchao

InfoSage05 commented Jul 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Imagineer99 commented Jul 3, 2026

Uh oh!

InfoSage05 commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

Imagineer99 commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

InfoSage05 commented Jul 3, 2026

What changed

Why it changed

Root cause

User impact

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Imagineer99 commented Jul 3, 2026

Uh oh!

InfoSage05 commented Jul 3, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jul 3, 2026

Uh oh!

Imagineer99 commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants